Skip to content

Generating MFC Images and Testing Them on OSPool #935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 46 commits into
base: master
Choose a base branch
from

Conversation

Malmahrouqi3
Copy link
Collaborator

@Malmahrouqi3 Malmahrouqi3 commented Jul 11, 2025

User description

Description

Concerning (#654),
Generating four images CPU, CPU_Benchmark, GPU, and GPU_Benchmark. All MFC builds occur on a GitHub runner, while testing and storing latest images take place on OSPOOL. They are retrievable on the CI itself as the images are pre-built MFC with pre-installed packages that can be accessed with simple commands.

Debugging info,
To locally generate images, apptainer build mfc_cpu.sif Singularity.cpu
To start shell instance, apptainer shell --fakeroot --writable-tmpfs mfc_cpu.sif
To execute directly specific commands, apptainer exec --fakeroot --writable-tmpfs mfc_cpu.sif /bin/bash -c 'cd /opt/MFC && ./mfc.sh test -a'

To-dos,

  • Proper packages and base container for each recipe.
  • htcondor test script to request specific allocations per image.
  • Sanity-check by using the images on various resources/clusters.
  • Maintainer triggered if needed, otherwise most recent images will only be hosted.

Note to Self: current secrets are hosted in the fork, and prior to merge new dedicated ones should be added to the base repo. To do so, request access point under "GATech_Bryngelson" project, then upload public SSH key to https://registry.cilogon.org/. Later on, update secrets which include private SSH key and user@host.

Ref's
NVIDIA Container


PR Type

Other


Description

  • Remove existing CI workflows and testing infrastructure

  • Add Singularity container image building workflow

  • Create four container definitions for CPU/GPU variants

  • Implement automated image building and testing on OSPool


Changes diagram

flowchart LR
  A["Old CI Workflows"] -- "removed" --> B["Deleted Files"]
  C["New Container Workflow"] -- "builds" --> D["Singularity Images"]
  D -- "stores on" --> E["OSPool"]
  F["Container Definitions"] -- "defines" --> G["CPU/GPU Variants"]
Loading

Changes walkthrough 📝

Relevant files
Miscellaneous
17 files
build.sh
Remove Frontier build script                                                         
+0/-9     
submit.sh
Remove Frontier job submission script                                       
+0/-56   
test.sh
Remove Frontier test script                                                           
+0/-10   
bench.sh
Remove Phoenix benchmark script                                                   
+0/-20   
submit-bench.sh
Remove Phoenix benchmark submission script                             
+0/-64   
submit.sh
Remove Phoenix job submission script                                         
+0/-64   
test.sh
Remove Phoenix test script                                                             
+0/-21   
bench.yml
Remove benchmark workflow                                                               
+0/-68   
cleanliness.yml
Remove code cleanliness workflow                                                 
+0/-127 
coverage.yml
Remove coverage check workflow                                                     
+0/-48   
docs.yml
Remove documentation workflow                                                       
+0/-76   
formatting.yml
Remove formatting check workflow                                                 
+0/-19   
line-count.yml
Remove line count workflow                                                             
+0/-54   
lint-source.yml
Remove source linting workflow                                                     
+0/-51   
lint-toolchain.yml
Remove toolchain linting workflow                                               
+0/-17   
spelling.yml
Remove spell check workflow                                                           
+0/-17   
test.yml
Remove main test suite workflow                                                   
+0/-131 
Enhancement
5 files
container-image.yml
Add Singularity image building workflow                                   
+63/-0   
Singularity.cpu
Add CPU container definition                                                         
+24/-0   
Singularity.cpu_bench
Add CPU benchmark container definition                                     
+27/-0   
Singularity.gpu
Add GPU container definition                                                         
+34/-0   
Singularity.gpu_bench
Add GPU benchmark container definition                                     
+32/-0   

Need help?
  • Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
  • Check out the documentation for more information.
  • @Copilot Copilot AI review requested due to automatic review settings July 11, 2025 15:46
    Copy link

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🎫 Ticket compliance analysis 🔶

    654 - Partially compliant

    Compliant requirements:

    • Enable MFC usage when dependencies don't exist on the target node

    Non-compliant requirements:

    • Provide instructions for building MFC on nodes without internet access
    • Address systems with classified or sensitive information restrictions
    • Provide at least rough guidance for offline MFC builds

    Requires further human verification:

    • Verification that pre-built containers can actually be used on offline systems
    • Testing container functionality on systems with network restrictions

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 PR contains tests
    🔒 Security concerns

    Sensitive information exposure:
    The workflow uses SSH private keys stored in GitHub secrets and connects to external systems. While using secrets is appropriate, the SSH connection setup and key handling should be reviewed to ensure proper security practices are followed.

    ⚡ Recommended focus areas for review

    Duplicate Code

    SSH setup code is duplicated between Build & Store Images and Test Images steps. The SSH key setup, known_hosts configuration, and file permissions are repeated unnecessarily.

        mkdir -p ~/.ssh
        echo "${{secrets.SSH_PRIVATE_KEY}}" >~/.ssh/id_rsa
        chmod 600 ~/.ssh/id_rsa
        ssh-keyscan -H ap40.uw.osg-htc.org >> ~/.ssh/known_hosts
    
        (cd pr/.github/workflows/images && sudo apptainer build mfc_cpu.sif Singularity.cpu)
        scp pr/.github/workflows/images/mfc_cpu.sif ${{secrets.SSH_USER}}:MFC/mfc_cpu.sif
        rm -rf pr/.github/workflows/images/mfc_cpu.sif
    
        (cd pr/.github/workflows/images && sudo apptainer build mfc_cpu_bench.sif Singularity.cpu_bench)
        scp pr/.github/workflows/images/mfc_cpu_bench.sif ${{secrets.SSH_USER}}:MFC/mfc_cpu_bench.sif
        rm -rf pr/.github/workflows/images/mfc_cpu_bench.sif
    
        (cd pr/.github/workflows/images && sudo apptainer build mfc_gpu.sif Singularity.gpu)
        scp pr/.github/workflows/images/mfc_gpu.sif ${{secrets.SSH_USER}}:MFC/mfc_gpu.sif
        rm -rf pr/.github/workflows/images/mfc_gpu.sif
    
        (cd pr/.github/workflows/images && sudo apptainer build mfc_gpu_bench.sif Singularity.gpu_bench)
        scp pr/.github/workflows/images/mfc_gpu_bench.sif ${{secrets.SSH_USER}}:MFC/mfc_gpu_bench.sif
        rm -rf pr/.github/workflows/images/mfc_gpu_bench.sif
    
    - name: Test Images
      run: |
        mkdir -p ~/.ssh
        echo "${{secrets.SSH_PRIVATE_KEY}}" >~/.ssh/id_rsa
        chmod 600 ~/.ssh/id_rsa
        ssh-keyscan -H ap40.uw.osg-htc.org >> ~/.ssh/known_hosts
    Possible Issue

    The Test Images step attempts to copy mfc_cpu.sif again after it was already copied and removed in the previous step. The test commands reference container files that may not exist locally on the runner.

    scp pr/.github/workflows/images/mfc_cpu.sif ${{secrets.SSH_USER}}:MFC/mfc_cpu.sif
    ssh ${{secrets.SSH_USER}} ""
    apptainer exec --fakeroot --writable-tmpfs mfc_cpu.sif /bin/bash -c 'cd /opt/MFC && ./mfc.sh test -a'
    apptainer exec --fakeroot --writable-tmpfs mfc_cpu_bench.sif /bin/bash -c 'cd /opt/MFC && ./mfc.sh test -a'
    apptainer exec --fakeroot --writable-tmpfs mfc_gpu.sif /bin/bash -c 'cd /opt/MFC && ./mfc.sh test -a'
    apptainer exec --fakeroot --writable-tmpfs mfc_gpu_bench.sif /bin/bash -c 'cd /opt/MFC && ./mfc.sh test -a'
    "
    Duplicate Code

    The Singularity container definitions share nearly identical package installation and build steps across all four variants, with only minor differences in compiler settings and runscript behavior.

    export DEBIAN_FRONTEND=noninteractive
    apt update -y && apt install -y \
        build-essential git tar wget make cmake gcc g++ \
        python3 python3-dev python3-venv \
        openmpi-bin libopenmpi-dev libfftw3-dev \
        python3-pip python3-venv
    cd /opt
    git clone --depth 1 https://github.com/mflowcode/mfc.git MFC
    cd /opt/MFC
    ./mfc.sh build -j $(nproc)
    ./mfc.sh test -a --dry-run -j $(nproc)

    Copy link
    Contributor

    @Copilot Copilot AI left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Pull Request Overview

    This PR replaces the existing CI workflows with a single pipeline that builds, deploys, and tests four Singularity images (CPU, CPU_Benchmark, GPU, GPU_Benchmark) on OSPOOL.

    • Removes legacy workflows for testing, linting, coverage, formatting, docs, benches, and cleanliness checks.
    • Introduces .github/workflows/container-image.yml to build images via Apptainer, SCP them to the remote host, and then execute tests inside each image.
    • Adds four Singularity definition files under .github/workflows/images/ to specify the CPU/GPU and benchmarking images.

    Reviewed Changes

    Copilot reviewed 22 out of 22 changed files in this pull request and generated 1 comment.

    Show a summary per file
    File Description
    .github/workflows/container-image.yml New workflow to build, store, and test Singularity images
    .github/workflows/images/Singularity.cpu Definition for the CPU image
    .github/workflows/images/Singularity.cpu_bench Definition for the CPU benchmark image
    .github/workflows/images/Singularity.gpu Definition for the GPU image
    .github/workflows/images/Singularity.gpu_bench Definition for the GPU benchmark image
    .github/workflows/test.yml Removed legacy test suite workflow
    .github/workflows/spelling.yml Removed spelling check workflow
    .github/workflows/lint-toolchain.yml Removed toolchain lint workflow
    .github/workflows/lint-source.yml Removed source lint workflow
    .github/workflows/formatting.yml Removed code formatting workflow
    .github/workflows/docs.yml Removed documentation build & publish workflow
    .github/workflows/coverage.yml Removed coverage check workflow
    .github/workflows/cleanliness.yml Removed cleanliness check workflow
    .github/workflows/line-count.yml Removed lines-of-code diff workflow
    .github/workflows/bench.yml Removed benchmark comparison workflow
    Comments suppressed due to low confidence (4)

    .github/workflows/container-image.yml:56

    • There’s an extraneous double-quote on this line, which will cause a YAML syntax error. Please remove it to ensure the run block is parsed correctly.
              "
    

    .github/workflows/container-image.yml:51

    • This ssh ... "" command is a no-op and is likely unintended. Either remove it or replace it with the actual remote command you want to run.
              ssh ${{secrets.SSH_USER}} ""
    

    .github/workflows/container-image.yml:28

    • You remove each .sif immediately after SCP in the Build step, so the Test Images step won’t find any local .sif files. Consider delaying the cleanup or reordering the steps so tests still run on the generated images.
              (cd pr/.github/workflows/images && sudo apptainer build mfc_cpu.sif Singularity.cpu)
    

    Copy link

    qodo-merge-pro bot commented Jul 11, 2025

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Fix SCP hostname specification

    The SSH connection string is missing the hostname portion. The format should be
    user@hostname:path but currently only has user:path, which will cause the SCP
    command to fail.

    .github/workflows/container-image.yml [29]

    -scp pr/.github/workflows/images/mfc_cpu.sif ${{secrets.SSH_USER}}:MFC/mfc_cpu.sif
    +scp pr/.github/workflows/images/mfc_cpu.sif ${{secrets.SSH_USER}}@ap40.uw.osg-htc.org:MFC/mfc_cpu.sif
    Suggestion importance[1-10]: 10

    __

    Why: The suggestion correctly identifies that the scp command is missing the hostname, which is a critical bug that would prevent file transfer and cause the workflow to fail.

    High
    Fix SSH command syntax error

    The SSH command has an empty command string and the multi-line string is
    improperly terminated. The empty SSH command serves no purpose and the closing
    quote should be removed to fix the syntax error.

    .github/workflows/container-image.yml [51-56]

    -ssh ${{secrets.SSH_USER}} ""
     apptainer exec --fakeroot --writable-tmpfs mfc_cpu.sif /bin/bash -c 'cd /opt/MFC && ./mfc.sh test -a'
     apptainer exec --fakeroot --writable-tmpfs mfc_cpu_bench.sif /bin/bash -c 'cd /opt/MFC && ./mfc.sh test -a'
     apptainer exec --fakeroot --writable-tmpfs mfc_gpu.sif /bin/bash -c 'cd /opt/MFC && ./mfc.sh test -a'
     apptainer exec --fakeroot --writable-tmpfs mfc_gpu_bench.sif /bin/bash -c 'cd /opt/MFC && ./mfc.sh test -a'
    -"
    Suggestion importance[1-10]: 9

    __

    Why: The suggestion correctly identifies a syntax error due to a stray " which would cause the workflow step to fail, making this a critical fix.

    High
    General
    Remove redundant file upload

    The workflow uploads the same image file twice to the same location, which is
    redundant and may cause race conditions. The second SCP command in the test step
    should be removed since the file was already uploaded in the build step.

    .github/workflows/container-image.yml [50]

    -scp pr/.github/workflows/images/mfc_cpu.sif ${{secrets.SSH_USER}}:MFC/mfc_cpu.sif
    +# Remove this line as the file was already uploaded in the build step
    Suggestion importance[1-10]: 9

    __

    Why: The suggestion correctly identifies that the scp command on line 50 is attempting to copy a file that was deleted in a previous step, which would cause the workflow to fail.

    High
    • Update

    @Malmahrouqi3
    Copy link
    Collaborator Author

    I removed all workflow files and will restore them once done, as they are not implicated with #654 whatsoever so far.

    @sbryngelson
    Copy link
    Member

    I removed all workflow files and will restore them once done, as they are not implicated with #654 whatsoever so far.

    wonderful thank you

    @Malmahrouqi3
    Copy link
    Collaborator Author

    Malmahrouqi3 commented Jul 11, 2025

    As of right now, I relied solely on ./mfc.sh test -a in all HTCondor job to ensure proper functionality of images and I will look into potential failure modes. CPU(GPU)_Benchmark images are identical to the standard version until given further instructions on the dedicated use of them. I was thinking of passing them as well onto Phoenix as another testbed since retrieving the images would take few moments avoiding the overhead of a full MFC build and compile.

    Requested allocated resources are quite excessive for now and will be optimized later on to not get stuck in the queue forever.

    @sbryngelson
    Copy link
    Member

    Grab the new workflow files from master and you can start doing CI again. You may need to merge in any changes you made. lmk if you have questions.

    @Malmahrouqi3
    Copy link
    Collaborator Author

    Malmahrouqi3 commented Jul 13, 2025

    Status Update: I faced a hurdle with ssh connectivity whether using SSH Keys (public/private) or Credentials (ssh user@host). Out of nowhere, the Access Point would deny access arbitrarily. If the issue persists, I will contact OSPool support. In the meantime, I will improve batch job requirements.

    Host key verification failed.
    scp: Connection closed
    Process completed with exit code 255.
    

    Edit: I am going to inquire on how to ensure each job instance occurs on a distinct cluster i.e. 5-10 instances of a single job would run on 5-10 unique clusters increasing failure potentials.

    @sbryngelson sbryngelson marked this pull request as draft July 14, 2025 07:04
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Development

    Successfully merging this pull request may close these issues.

    2 participants